Skip to content

fix(msmarco): address PR #46 review comments and code review findings#52

Open
ashwin2002 wants to merge 1 commit into
mainfrom
sparse_embedding_enhancements
Open

fix(msmarco): address PR #46 review comments and code review findings#52
ashwin2002 wants to merge 1 commit into
mainfrom
sparse_embedding_enhancements

Conversation

@ashwin2002
Copy link
Copy Markdown
Contributor

Step-based partitioning and correctness fixes for loadMSMARCODataset:

  • Replace flat even-split with MSMARCOEmbeddingProduct.getSteps() loop, matching CLI (MSMARCOLoader) and REST SIFT (loadSIFTDataset) behaviour
  • Add start_offset bounds check against STEPS so out-of-range values fail fast with IllegalArgumentException instead of AIOOBE
  • Add end_offset bounds check against steps[last] to prevent AIOOBE in the outer while loop when caller sends value beyond dataset size
  • Add end_offset <= start_offset guard to replace the deleted totalDocs <= 0 check, restoring the silent no-op diagnostic
  • Cap workers per range with effectivePool = min(poolSize, end-start) to prevent step=0 integer division that silently created zero-range workers when doc count was smaller than processConcurrency
  • Stage all WorkLoadGenerate instances in a local map before flushing to loader_tasks so a mid-loop failure leaves no orphaned tasks
  • Remove redundant ws.embeddingFilePath post-construction assignment already set by the WorkLoadSettings constructor
  • Strip trailing whitespace from three blank lines in SDKClientPool introduced by the PR

Used Claude Code for code generation.
Model used: claude-sonnet-4-6.

Step-based partitioning and correctness fixes for loadMSMARCODataset:
- Replace flat even-split with MSMARCOEmbeddingProduct.getSteps() loop,
  matching CLI (MSMARCOLoader) and REST SIFT (loadSIFTDataset) behaviour
- Add start_offset bounds check against STEPS so out-of-range values
  fail fast with IllegalArgumentException instead of AIOOBE
- Add end_offset bounds check against steps[last] to prevent AIOOBE
  in the outer while loop when caller sends value beyond dataset size
- Add end_offset <= start_offset guard to replace the deleted
  totalDocs <= 0 check, restoring the silent no-op diagnostic
- Cap workers per range with effectivePool = min(poolSize, end-start)
  to prevent step=0 integer division that silently created zero-range
  workers when doc count was smaller than processConcurrency
- Stage all WorkLoadGenerate instances in a local map before flushing
  to loader_tasks so a mid-loop failure leaves no orphaned tasks
- Remove redundant ws.embeddingFilePath post-construction assignment
  already set by the WorkLoadSettings constructor
- Strip trailing whitespace from three blank lines in SDKClientPool
  introduced by the PR

Used Claude Code for code generation.
Model used: claude-sonnet-4-6.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Addresses prior PR #46 review feedback by aligning the REST loadMSMARCODataset() partitioning with the CLI MSMARCOLoader step-based scheme, hardening offset validation, fixing a divide-by-zero edge case when doc count is smaller than the pool, and staging tasks atomically before publishing to the global loader_tasks map. Also drops a redundant embeddingFilePath assignment and cleans trailing whitespace in SDKClientPool.

Changes:

  • Replace the flat even-split with a MSMARCOEmbeddingProduct.getSteps() overlap loop and add start_offset/end_offset/empty-range bounds checks that throw IllegalArgumentException early.
  • Cap workers via effectivePool = min(poolSize, end-start), stage WorkLoadGenerate instances in a local map and flush via putAll, and remove the duplicate ws.embeddingFilePath assignment.
  • Whitespace-only cleanup in SDKClientPool.release_client().

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/main/java/RestServer/TaskRequest.java Rewrites MSMARCO partitioning to step-loop form, adds offset validation, atomic task staging, new task-name composition.
src/main/java/couchbase/sdk/SDKClientPool.java Strips trailing whitespace from three blank lines in release_client().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1061 to +1062
String task_name = "MSMARCOTask_" + TaskRequest.task_id.incrementAndGet() + k + "_" + ws.dr.create_s
+ "_" + ws.dr.create_e;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants